Lab of an IoT course at Télécom Paris. Case study given by Sigfox.
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from geopy.distance import vincenty
import seaborn as sns
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_predict
import warnings
warnings.filterwarnings("ignore")
# load train and test data
df_mess_train = pd.read_csv('./data/mess_train_list.csv') # train set
pos_train = pd.read_csv('./data/pos_train_list.csv') # position associated to train set
print("Number of observations in messages: ", df_mess_train.shape[0])
print("Number of observations in positions:", pos_train.shape[0])
df_mess_train.head()
df_mess_train.describe()
df_mess_train.isnull().sum(axis = 0)
df_mess_train.isna().sum(axis = 0)
There is no null or nan value.
pos_train.head()
pos_train.describe()
pos_train.isna().sum(axis = 0)
pos_train.isnull().sum(axis = 0)
There is no null or nan value.
To simplify the visualization and creation of our features matrix, we associate the current features table with the positions table (and thus the labels).
df_mess_train[['pos_lat', 'pos_lng']] = pos_train
df_mess_train.head()
df_mess_train.describe()
We count the unique values:
print("Number of unique messages: %d" %df_mess_train.messid.nunique())
print("Number of unique stations: %d" %df_mess_train.bsid.nunique())
print("Number of unique devices: %d" %df_mess_train.did.nunique())
print("Number of unique values of time_ux: %d" %df_mess_train.time_ux.nunique())
print("Number of unique values of rssi: %d" %df_mess_train.rssi.nunique())
print("Number of unique values of nseq: %d" %df_mess_train.nseq.nunique())
print("Number of unique values of bs_lat: %d" %df_mess_train.bs_lat.nunique())
print("Number of unique values of bs_lng: %d" %df_mess_train.bs_lng.nunique())
print("Number of unique values of pos_lat: %d" %pos_train.lat.nunique())
print("Number of unique values of pos_lng: %d" %pos_train.lng.nunique())
We notice that:
time_ux.pos_lat than there are messages, which implies that we will have to means the values for instance.bs_lat and bs_lng.We will look at the distribution of values for some variables in our training and validation set.
nseq¶plt.figure(figsize=(7,4))
plt.hist(df_mess_train.nseq, color='red')
plt.title("Distribution of values for nseq")
plt.show()
This variable takes only 5 different values between 0 and 2, symmetrically distributed around 1.
rssi¶plt.figure(figsize=(7,4))
sns.distplot(df_mess_train.rssi, bins=200, color='red')
plt.title("Distribution of values for rssi")
plt.show()
The values are distributed around -100 and -140, most around -130.
time_ux¶plt.figure(figsize=(7,4))
sns.distplot(df_mess_train.time_ux, bins=100, kde=False, color='red')
plt.title("Distribution of values for time_ux")
plt.show()
The data are evenly distributed over time, with a significant peak towards the end.
bsid¶plt.figure(figsize=(7,4))
sns.distplot(df_mess_train.bsid, bins=100, kde=False, color='red')
plt.title("Distribution of the number of messages by bsid")
plt.show()
We note that some stations receive many more messages than others. This should be taken into account when analyzing the poorly represented categories.
did¶plt.figure(figsize=(7,4))
abcisse = list((x for x in range(0,df_mess_train.did.nunique())))
plt.bar(abcisse, list(df_mess_train.groupby(['did']).messid.nunique()), width=1, color='red')
plt.title("Distribution of number of messages by device id")
plt.show()
We note that the values of the device ids are also very different. Some devices emit many more messages than others. The repartition is quite similar to that of the station ids (bsid), because some devices stay near the same stations when they emit. We will alse have to look at the number of messages per device_id and analyze the device_id less represented.
plt.figure(figsize=(7,5), dpi=80)
sns.heatmap(df_mess_train.corr(), cmap='PuOr', center=0, annot=True)
plt.title('Correlation matrix', fontsize=12)
plt.show()
We note that there are few significant correlations between variables, with the exception of station longitude and lattitude.
plt.figure(figsize=(7,4))
plt.plot(df_mess_train['bs_lat'], df_mess_train['bs_lng'], 'x', color='red')
plt.title('Position of stations')
plt.xlabel('Base Station Lat')
plt.ylabel('Base Station Long')
plt.show()
import plotly.express as px
import plotly.graph_objects as go
#fig = px.scatter_mapbox(df_mess_train, lat="bs_lat", lon="bs_lng", color_discrete_sequence=["fuchsia"], name="Stations", zoom=1.8, height=400)
fig = go.Figure()
fig.add_trace(go.Scattermapbox(lat=df_mess_train["bs_lat"], lon=df_mess_train["bs_lng"], name="Stations",
marker=go.scattermapbox.Marker(size=5,color='red',opacity=0.7,)))
fig.add_trace(go.Scattermapbox(lat=df_mess_train["pos_lat"], lon=df_mess_train["pos_lng"], name="Messages",
marker=go.scattermapbox.Marker(size=4,color='blue',opacity=0.7,)))
fig.update_layout(mapbox_style="open-street-map", height=400)
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()
Some stations are far away from the others. These stations are located in the north of Canada, at a lattitude of 65 and a longitude of -70 .
All the messages are located in the same area, and surrounded by stations.
We will see if the distant stations have received messages:
df_messages_distant_stations = df_mess_train[(df_mess_train["bs_lat"] > 42) & (df_mess_train["bs_lng"] > -104)]
fig = go.Figure()
fig.add_trace(go.Scattermapbox(lat=df_messages_distant_stations["bs_lat"], lon=df_messages_distant_stations["bs_lng"], name="Stations",
marker=go.scattermapbox.Marker(size=5,color='red',opacity=0.7,)))
fig.add_trace(go.Scattermapbox(lat=df_messages_distant_stations["pos_lat"], lon=df_messages_distant_stations["pos_lng"], name="Messages",
marker=go.scattermapbox.Marker(size=5,color='blue',opacity=0.7,)))
fig.update_layout(mapbox_style="open-street-map", height=400)
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()
It is therefore clear that no message was sent at longitudes and latitudes near -70 and 65. We can consider these stations as outliers. Consequently, we deduce that the latitudes and longitudes of these stations are erroneous and a strategy to deal with these anomalies should be defined.
We list these outlier stations:
list_station_outliers = df_messages_distant_stations["bsid"].unique()
list_devices_outliers = df_messages_distant_stations["did"].unique()
list_messages_outliers = df_messages_distant_stations["messid"].unique()
messages_to_outliers = df_mess_train[df_mess_train['bsid'].isin(list_station_outliers)].groupby('bsid').count()
print("Number of outlier stations: ", len(list_station_outliers))
print("Number of devices sending messages to outlier stations:", len(list_devices_outliers))
print("Number of messages receveid by outlier stations:", messages_to_outliers['messid'].sum())
print("Represent a percentage of {:.2f}% among all the received messages".format(100*messages_to_outliers['messid'].sum() / df_mess_train.shape[0]))
The concerned messages received by the outliers represent a quite consequent part of the dataset.
We will check if the concerned devices have sent messages to other stations:
df_mess_to_outliers = df_mess_train[df_mess_train['messid'].isin(list_messages_outliers)]
fig = go.Figure()
fig.add_trace(go.Scattermapbox(lat=df_mess_to_outliers["bs_lat"], lon=df_mess_to_outliers["bs_lng"], name="Stations",
marker=go.scattermapbox.Marker(size=6,color='red',opacity=0.7,)))
fig.add_trace(go.Scattermapbox(lat=df_mess_to_outliers["pos_lat"], lon=df_mess_to_outliers["pos_lng"], name="Messages",
marker=go.scattermapbox.Marker(size=6,color='blue',opacity=0.7,)))
fig.update_layout(mapbox_style="open-street-map", height=400)
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()
We see that meanwhile an important number of messages are concerned by the outliers, there alse are an important number of other stations that have received the same messages, and they can cover the reception of those messages.
So we define 3 options:
The 3 options are implemented and the final models are tested with the 3 resulting dataset. A table summarizes the scores obtained with each data set (at the end). For lisibility, we will let the OPTION 3 as default.
OPTION = 3
if OPTION == 1:
index_to_remove1 = df_mess_train[df_mess_train['bsid'].isin(list_station_outliers)].index.values
df_mess_train = df_mess_train.drop(index_to_remove1)
df_mess_train.shape
We will replace the outliers, by using the other points and their rssi.
if OPTION == 2:
for msg in list_messages_outliers:
df_bymessage = df_mess_train[df_mess_train['messid'].isin(list_messages_outliers)]
index = df_bymessage.index[(df_bymessage["bs_lat"] > 42) & (df_bymessage["bs_lng"] > -104)].tolist()
for idx in index:
# We take the mean of the 3 nearest values
value = df_bymessage.loc[idx,'rssi']
tocompare = df_bymessage.iloc[(df_bymessage['rssi'] - value).abs().argsort()[1:4:]]#
index_tocompare = tocompare.index[(tocompare["bs_lat"] < 42) | (tocompare["bs_lng"] < -104)].tolist()
if (len(index_tocompare)) > 0:
v_lat = tocompare.loc[index_tocompare, 'bs_lat'].mean()
v_lng = tocompare.loc[index_tocompare, 'bs_lng'].mean()
df_mess_train.loc[idx,'bs_lat'] = v_lat
df_mess_train.loc[idx, 'bs_lng'] = v_lng
else:
df_mess_train = df_mess_train.drop(idx)
We are going to implement another way to replace the points, by random forest:
listOfMessIds = df_mess_to_outliers["messid"].unique()
listOfMessIds = ["mess_"+str(code) for code in listOfMessIds]
print("Number of unique messages : ", len(listOfMessIds))
ohe = OneHotEncoder()
X_messid = ohe.fit_transform(df_mess_to_outliers[['messid']]).toarray()
df_messid_train = pd.DataFrame(X_messid, columns = listOfMessIds)
df_mess_to_outliers[listOfMessIds] = df_messid_train
df_mess_to_outliers.fillna(0, inplace=True)
train = df_mess_to_outliers[~df_mess_to_outliers["bsid"].isin(list_station_outliers)]
pred = df_mess_to_outliers[df_mess_to_outliers["bsid"].isin(list_station_outliers)]
print(train.shape, pred.shape)
## Train set: the messages received by stations correctly located.
y_lat = train['bs_lat']
y_lng = train['bs_lng']
X = train.drop(["bs_lat","bs_lng", "messid"], axis=1)
X_train, X_test, y_lat_train, y_lat_test, y_lng_train, y_lng_test = train_test_split(X, y_lat, y_lng, test_size = 0.2, random_state=261)
## Pred set: the messages received by outlier stations
X_pred = pred.drop(["bs_lat","bs_lng", "messid"], axis=1)
clf_lat = RandomForestRegressor(n_estimators=1000, n_jobs=-1, random_state=261)
clf_lat.fit(X_train, y_lat_train)
clf_lng = RandomForestRegressor(n_estimators=1000, n_jobs=-1, random_state=261)
clf_lng.fit(X_train, y_lng_train)
score_lat = clf_lat.score(X_test, y_lat_test)
score_lng = clf_lng.score(X_test, y_lng_test)
print(score_lat)
print(score_lng)
X_pred["bs_lat_new"] = clf_lat.predict(X_pred)
X_pred["bs_lng_new"] = clf_lng.predict(X_pred)
X_pred.to_csv("./data/station-relocated.csv")
X_pred = pd.read_csv("./data/station-relocated.csv")
X_pred.set_index("Unnamed: 0", inplace=True)
X_pred.index.name = None
X_pred
df_mess_train = df_mess_train.merge(X_pred, how="left", left_index=True, right_index=True)
df_mess_train["bs_lat"] = df_mess_train.apply(lambda x: x["bs_lat_new"] if not pd.isna(x["bs_lat_new"]) else x["bs_lat"], axis=1)
df_mess_train["bs_lng"] = df_mess_train.apply(lambda x: x["bs_lng_new"] if not pd.isna(x["bs_lng_new"]) else x["bs_lng"], axis=1)
df_mess_train.drop(["bs_lat_new", "bs_lng_new"], axis=1, inplace=True)
df_mess_train.tail(5)
Here is the result of the relocated stations (in green on the map):
fig = go.Figure()
mask = df_mess_train3["bsid"].isin(list_station_outliers)
fig.add_trace(go.Scattermapbox(lat=df_mess_train3[~mask]["bs_lat"], lon=df_mess_train3[~mask]["bs_lng"], name="Stations",
marker=go.scattermapbox.Marker(size=7,color='red',opacity=0.7,)))
fig.add_trace(go.Scattermapbox(lat=df_mess_train3[mask]["bs_lat_new"], lon=df_mess_train3[mask]["bs_lng_new"], name="Stations relocated",
marker=go.scattermapbox.Marker(size=7,color='green',opacity=0.7,)))
fig.add_trace(go.Scattermapbox(lat=df_mess_train3["pos_lat"], lon=df_mess_train3["pos_lng"], name="Messages",
marker=go.scattermapbox.Marker(size=7,color='blue',opacity=0.7,)))
fig.update_layout(mapbox_style="open-street-map", height=400)
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()
We isolate one message, and display the location of all the stations that have received it. The relocated station (in green) was badly located in the north of Canada before, and is now in the town downtown area of Denver:
df_mess_to_outliers4 = df_mess_train3[df_mess_train3["messid"] == "57617e1ef0fe6e0c9fd6eb06"]
df_mess_to_outliers4
fig = go.Figure()
mask = df_mess_to_outliers4["bsid"].isin(list_station_outliers)
fig.add_trace(go.Scattermapbox(lat=df_mess_to_outliers4[~mask]["bs_lat"], lon=df_mess_to_outliers4[~mask]["bs_lng"], name="Stations",
marker=go.scattermapbox.Marker(size=7,color='red',opacity=0.7,)))
fig.add_trace(go.Scattermapbox(lat=df_mess_to_outliers4[mask]["bs_lat_new"], lon=df_mess_to_outliers4[mask]["bs_lng_new"], name="Stations relocated",
marker=go.scattermapbox.Marker(size=7,color='green',opacity=0.7,)))
fig.add_trace(go.Scattermapbox(lat=df_mess_to_outliers4["pos_lat"], lon=df_mess_to_outliers4["pos_lng"], name="Messages",
marker=go.scattermapbox.Marker(size=7,color='blue',opacity=0.7,)))
fig.update_layout(mapbox_style="open-street-map", height=400)
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()
We will try to find the stations that process few messages to potentially remove them from the training set in order to keep only the most representative stations and thus have the most reliable categories to make predictions.
# We look base stations that aren't getting a lot of messages
count_basestation = df_mess_train.groupby('bsid').count()
count_basestation = count_basestation['messid']
mes_limit = 500 # Limit
plt.figure(figsize=(8,6))
count_basestation_cum = count_basestation.sort_values(ascending=True).cumsum()
plt.plot(count_basestation_cum.values)
x = [0, count_basestation_cum.count()]
y = [mes_limit, mes_limit]
plt.plot(x, y, color ='black')
plt.title("Cumulative sum of messages according to the {} stations \n".format(df_mess_train.bsid.nunique()), size=12)
plt.show()
print("Number of stations under a cumulative sum of {:d} : {:d}".format(mes_limit, (count_basestation_cum < mes_limit).sum()))
We can therefore see that there are 95 stations that receive few messages but that the others receive significantly more.
We therefore decide to remove the stations receiving few messages:
# Removing the corresponding messages
bsid_to_remove = count_basestation_cum[count_basestation_cum < mes_limit].index.values
df_mess_train = df_mess_train[~df_mess_train.bsid.isin(bsid_to_remove)]
# Reset index
df_mess_train = df_mess_train.reset_index().drop(columns=['index'])
n_train = df_mess_train.shape[0]
df_mess_train.shape
List the remaining stations:
listOfBs = df_mess_train.bsid.unique()
listNameBs = ["bs"+str(code) for code in listOfBs]
print("Number of remaining stations : ", len(listOfBs))
One hot encoding:
ohe = OneHotEncoder()
X_bsid = ohe.fit_transform(df_mess_train[['bsid']]).toarray()
df_bsid_train = pd.DataFrame(X_bsid[:n_train,:], columns = listNameBs)
# We add the columns from our encoder to our training dataset
df_mess_train[listNameBs] = df_bsid_train
We keep the did associated with each message, to be able to use it in the final dataset.
listOfDid = df_mess_train.did.unique()
listNameDid = ["did"+str(int(i)) for i in listOfDid]
print("Number of remaining devices: ", len(listOfDid))
df_grouped_train_did = df_mess_train.groupby(['messid', 'did']).count().reset_index(level=['did'])["did"]
messid¶df_grouped_train = df_mess_train.groupby(['messid'])
bsid by messid¶We retrieve all the stations affected by a given messid:
df_bsid_grouped_train = df_grouped_train.sum()[listNameBs]
df_bsid_grouped_train.head()
We count, for a random message, how many stations have received it:
df_bsid_grouped_train.loc[['573bf1d9864fce1a9af8c5c9']].sum(axis=1)
bsid by messid¶count_bsid_grouped_train = df_bsid_grouped_train.sum(axis=1).values
print('min = ', count_bsid_grouped_train.min(), '; max = ', count_bsid_grouped_train.max())
Average Device ID by messid, in fact it's not the average, it's the only did of the message
did_grouped_train = df_grouped_train.mean()['did'].values
did_grouped_train
rssi_grouped_train = df_grouped_train.mean()['rssi'].values
len(rssi_grouped_train)
time_ux_grouped_train = df_grouped_train.mean()['time_ux'].values
lat_grouped_train = df_grouped_train.mean()['bs_lat'].values
lng_grouped_train = df_grouped_train.mean()['bs_lng'].values
# Weighted by the RSSI, the strength of the received signal
lat_rssi_grouped_train = np.array([np.average(elmt['bs_lat'], weights=elmt['rssi']) for key, elmt in df_grouped_train])
lng_rssi_grouped_train = np.array([np.average(elmt['bs_lng'], weights=elmt['rssi']) for key, elmt in df_grouped_train])
# Weighted by time_ux
time_ux_lat_grouped_train = np.array([np.average(elmt['bs_lat'], weights=elmt['time_ux']) for key, elmt in df_grouped_train])
time_ux_lng_grouped_train = np.array([np.average(elmt['bs_lng'], weights=elmt['time_ux']) for key, elmt in df_grouped_train])
# Weighted by nseq
nseq_lat_grouped_train = np.array([np.average(elmt['bs_lat'], weights=elmt['nseq']+1) for key, elmt in df_grouped_train])
nseq_lng_grouped_train = np.array([np.average(elmt['bs_lng'], weights=elmt['nseq']+1) for key, elmt in df_grouped_train])
# Groupby label
pos_lat_grouped_train = df_grouped_train.mean()['pos_lat'].values
pos_lng_grouped_train = df_grouped_train.mean()['pos_lng'].values
We build a dataframe based on the dataframe grouped by messid that we did. We choose to average the different variables for the same message.
# We create the dataframe, with the features we want to add
df_train = pd.DataFrame()
df_train["did"] = df_grouped_train_did
df_train['mean_rssi'] = rssi_grouped_train
df_train['mean_lat'] = lat_grouped_train
df_train['mean_lng'] = lng_grouped_train
df_train['mean_lat_rssi'] = lat_rssi_grouped_train
df_train['mean_lng_rssi'] = lng_rssi_grouped_train
df_train['mean_time_ux'] = time_ux_grouped_train
df_train[listNameBs] = df_bsid_grouped_train
df_train['pos_lat'] = pos_lat_grouped_train
df_train['pos_lng'] = pos_lng_grouped_train
df_train
We will see which features are most important on a RandomForestRegressor model. This will allow us to fine-tune the selection of our variables and improve the training performance of our model.
X_train = df_train.iloc[:,:-2]
y_lat_train = df_train['pos_lat']
y_lng_train = df_train['pos_lng']
We fit the RandomForest:
clf_lat = RandomForestRegressor(n_estimators=1000, n_jobs=-1, random_state=261)
clf_lat.fit(X_train, y_lat_train)
clf_lng = RandomForestRegressor(n_estimators=1000, n_jobs=-1, random_state=261)
clf_lng.fit(X_train, y_lng_train)
Calculation of feature importance for latitude and longitude:
dict_feature_importance_lat = {'feature': X_train.columns.values,
'importance': clf_lat.feature_importances_}
feature_importances_lat = pd.DataFrame(data=dict_feature_importance_lat).sort_values('importance', ascending=False)
dict_feature_importance_lng = {'feature': X_train.columns.values,
'importance': clf_lng.feature_importances_}
feature_importances_lng = pd.DataFrame(data=dict_feature_importance_lng).sort_values('importance', ascending=False)
feature_importances_lat
feature_importances_lng
importance_treshold = 0.000025
Result for the prediction of the latitude:
mask_lat = feature_importances_lat['importance'] > importance_treshold
plt.figure(figsize=(12,12))
plt.barh(feature_importances_lat['feature'][mask_lat], feature_importances_lat['importance'][mask_lat])
plt.title('Feature importance for latitude')
plt.tight_layout()
plt.gca().invert_yaxis()
plt.show()
We do the same for the longitude:
mask_lng = feature_importances_lng['importance'] > importance_treshold
plt.figure(figsize=(12,12))
plt.barh(feature_importances_lng['feature'][mask_lng], feature_importances_lng['importance'][mask_lng])
plt.title('Feature importance for longitude')
plt.tight_layout()
plt.gca().invert_yaxis()
plt.show;
It is therefore quite clear that only a few variables are of overriding importance compared to the others. We are therefore going to remove from our training set the features whose importance is low, i.e. at the threshold of 0.000025.
mask_lat = feature_importances_lat['importance'] > importance_treshold
mask_lng = feature_importances_lng['importance'] > importance_treshold
# We use set() to have the intersection of our ensembles
indexes_to_remove = list(set(feature_importances_lat['feature'][np.logical_not(mask_lat)]
).intersection(set(feature_importances_lng['feature'][np.logical_not(mask_lng)])))
print("{:d} features have an importance lower than {:.6f}.".format(len(indexes_to_remove), importance_treshold))
X_train = X_train.drop(indexes_to_remove, axis=1)
# Evaluation function of our results
def vincenty_vec(vec_coord):
vin_vec_dist = np.zeros(vec_coord.shape[0])
if vec_coord.shape[1] != 4:
print('ERROR: Bad number of columns (shall be = 4)')
else:
vin_vec_dist = [vincenty(vec_coord[m,0:2],vec_coord[m,2:]).meters for m in range(vec_coord.shape[0])]
return vin_vec_dist
# Evaluate distance error for each predicted point
def Eval_geoloc(y_train_lat , y_train_lng, y_pred_lat, y_pred_lng):
vec_coord = np.array([y_train_lat , y_train_lng, y_pred_lat, y_pred_lng])
err_vec = vincenty_vec(np.transpose(vec_coord))
return err_vec
# Display of cumulative error
def grap_error(err_vec):
# On affiche le graphe des erreurs de distance cumulées
values, base = np.histogram(err_vec, bins=50000)
cumulative = np.cumsum(values)
plt.figure()
plt.plot(base[:-1]/1000, cumulative / np.float(np.sum(values)) * 100.0, c='blue')
plt.grid()
plt.xlabel('Distance Error (km)')
plt.ylabel('Cum proba (%)'); plt.axis([0, 30, 0, 100])
plt.title('Error Cumulative Probability')
plt.legend( ["Opt LLR", "LLR 95", "LLR 99"])
plt.show();
RandomForestRegressor¶We will now optimize our RandomForest algorithm. To do this, we will look at the depth of the max_depth tree, the proportion of features to be considered at each branch separation max_features, and the number of n_estimators.
# We prepare our dataframes to perform a gridSearch
Xtrain_cv, Xtest_cv, y_lat_train_cv, y_lat_test_cv, y_lng_train_cv, y_lng_test_cv = \
train_test_split(X_train, y_lat_train, y_lng_train, test_size=0.2, random_state=261)
We perform a manual Grid Search, rather than using prebuilt function, because it is more adapted to optimize our two models together.
# Manual Gridsearch
list_max_depth = [20, 25, 30, 35, 40, 45, 50]
list_max_features = [0.5, 0.6, 0.7, 0.8, 0.9, None]
list_n_estimators = [50, 100, 200]
err80 = 10000
list_result =[]
for max_depth in list_max_depth:
print('Step max_depth : ', str(max_depth))
for max_features in list_max_features:
#print(max_features)
for n_estimators in list_n_estimators:
#print(n_estimators)
clf_rf_lat = RandomForestRegressor(n_estimators = n_estimators, max_depth=max_depth,
max_features = max_features, n_jobs=-1)
clf_rf_lat.fit(Xtrain_cv, y_lat_train_cv)
y_pred_lat = clf_rf_lat.predict(Xtest_cv)
clf_rf_lng = RandomForestRegressor(n_estimators = n_estimators, max_depth=max_depth,
max_features = max_features,n_jobs=-1)
clf_rf_lng.fit(Xtrain_cv, y_lng_train_cv)
y_pred_lng = clf_rf_lng.predict(Xtest_cv)
err_vec = Eval_geoloc(y_lat_test_cv , y_lng_test_cv, y_pred_lat, y_pred_lng)
perc = np.percentile(err_vec, 80)
list_result.append((max_depth,max_features,n_estimators, perc))
if perc < err80: # minimum error distance for 80% of the observations
err80 = perc
best_max_depth = max_depth
best_max_features = max_features
best_n_estimators = n_estimators
print('--- Final results ---')
print('\nbest_max_depth', best_max_depth)
print('best_max_features', best_max_features)
print('best_n_estimators', best_n_estimators)
print('err80', err80)
We train our RandomForest model on 80% of the train set and validate it on the remaining 20%. The model is trained with the best hyperparameters we just found.
clf_rf_lat = RandomForestRegressor(n_estimators = best_n_estimators,
max_features = best_max_features,
max_depth = best_max_depth, n_jobs=-1)
clf_rf_lat.fit(Xtrain_cv, y_lat_train_cv)
y_pred_lat = clf_rf_lat.predict(Xtest_cv)
clf_rf_lng = RandomForestRegressor(n_estimators = best_n_estimators,
max_features = best_max_features,
max_depth = best_max_depth, n_jobs=-1)
clf_rf_lng.fit(Xtrain_cv, y_lng_train_cv)
y_pred_lng = clf_rf_lng.predict(Xtest_cv)
err_vec = Eval_geoloc(y_lat_test_cv , y_lng_test_cv, y_pred_lat, y_pred_lng)
print("Cumulative distance error at 80% : {} \n" .format((np.percentile(err_vec, 80))))
grap_error(err_vec)
ExtraTreesRegressor¶We test another ensemble algorithm with a new GridSearch.
# Manual Gridsearch
list_max_depth = [20, 25, 30, 35, 40, 45, 50, 55, 60]
list_max_features = [0.5, 0.6, 0.7, 0.8, 0.9, None]
list_n_estimators = [50, 100, 200]
err80 = 10000
list_result =[]
for max_depth in list_max_depth:
print('\Step max_depth : '+str(max_depth))
for max_features in list_max_features:
#print(max_features)
for n_estimators in list_n_estimators:
#print(n_estimators)
clf_et_lat = ExtraTreesRegressor(n_estimators = n_estimators, max_depth=max_depth,
max_features = max_features, n_jobs=-1)
clf_et_lat.fit(Xtrain_cv, y_lat_train_cv)
y_pred_lat = clf_et_lat.predict(Xtest_cv)
clf_et_lng = ExtraTreesRegressor(n_estimators = n_estimators, max_depth=max_depth,
max_features = max_features,n_jobs=-1)
clf_et_lng.fit(Xtrain_cv, y_lng_train_cv)
y_pred_lng = clf_et_lng.predict(Xtest_cv)
err_vec = Eval_geoloc(y_lat_test_cv , y_lng_test_cv, y_pred_lat, y_pred_lng)
perc = np.percentile(err_vec, 80)
list_result.append((max_depth,max_features,n_estimators, perc))
if perc < err80: # distance erreur mini pour 80% des observations
err80 = perc
best_max_depth = max_depth
best_max_features = max_features
best_n_estimators = n_estimators
print('--- Final results ---')
print('\nbest_max_depth', best_max_depth)
print('best_max_features', best_max_features)
print('best_n_estimators', best_n_estimators)
print('err80', err80)
We then train and validate the model:
clf_lat = ExtraTreesRegressor(n_estimators=best_n_estimators, max_features=best_max_features,
max_depth=best_max_depth, n_jobs=-1)
clf_lat.fit(Xtrain_cv, y_lat_train_cv)
y_pred_lat = clf_lat.predict(Xtest_cv)
clf_lng = ExtraTreesRegressor(n_estimators=best_n_estimators, max_features=best_max_features,
max_depth=best_max_depth, n_jobs=-1)
clf_lng.fit(Xtrain_cv, y_lng_train_cv)
y_pred_lng = clf_lng.predict(Xtest_cv)
err_vec = Eval_geoloc(y_lat_test_cv , y_lng_test_cv, y_pred_lat, y_pred_lng)
print("Cumulative distance error at 80% : {} \n" .format((np.percentile(err_vec, 80))))
grap_error(err_vec)
The score are very close (slightly better with the ExtraTreeRegressor).
In this part we will implement a cross validation, of type leave one out. This is necessary to have a more realistic prediction score, when we will use our model with never seen devices, in production. To do so, we will train the model on the whole training set deprived of all the messages of one device (a different device at each iteration).
We will perform the cross validation on both models because their scores were close.
We create the groups, defined by the device ids
groups = np.array(X_train["did"].tolist())
groups
from sklearn.model_selection import LeaveOneGroupOut
logo = LeaveOneGroupOut()
logo.get_n_splits(X_train, y_lat_train, groups)
RandomForestRegressor¶We perform the cross validation for the latitude and longitude:
cv_lat = logo.split(X_train, y_lat_train, groups)
y_pred_lat = cross_val_predict(clf_rf_lat, X_train, y_lat_train, cv=cv_lat)
cv_lng = logo.split(X_train, y_lng_train, groups)
y_pred_lng = cross_val_predict(clf_rf_lng, X_train, y_lng_train, cv=cv_lng)
We display the error curve:
err_vec = Eval_geoloc(y_lat_train , y_lng_train, y_pred_lat, y_pred_lng)
print("Cumulative distance error at 80% : {} \n" .format((np.percentile(err_vec, 80))))
grap_error(err_vec)
ExtraTreesRegressor¶We do the same with this model:
cv_lat = logo.split(X_train, y_lat_train, groups)
y_pred_lat = cross_val_predict(clf_lat, X_train, y_lat_train, cv=cv_lat)
cv_lng = logo.split(X_train, y_lng_train, groups)
y_pred_lng = cross_val_predict(clf_lng, X_train, y_lng_train, cv=cv_lng)
err_vec = Eval_geoloc(y_lat_train , y_lng_train, y_pred_lat, y_pred_lng)
print("Cumulative distance error at 80% : {} \n" .format((np.percentile(err_vec, 80))))
grap_error(err_vec)
The score is much worse in cross validation by leave one device out, because the predictions are made on devices on which the model has not been trained. This is a more realistic score.
| Outliers deleted | Outliers replaced by averaging | Outliers replaced by random forest | |
|---|---|---|---|
| RandomForestRegressor | 2679 | 2691 | 2645 |
| ExtraTreesRegressor | 2669 | 2663 | 2544 |
| RandomForestRegressor cross-validation | 6271 | -- | 5625 |
| ExtraTreesRegressor cross-validation | 5276 | 5370 | 5525 |